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Abstract 

Word  importance  discrimination  is  a  task  deserving  attention  when  one  treats  a 
topic  from  TREC  where  a  topic  is  quite  long.  The  goal  of  the  process  is  to  es¬ 
timate  importance  of  words  which  carry  any  (additional)  information  about  user 
information  needs.  In  our  experiments  we  estimated  word  importance  using  context 
information  of  a  word. 
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1  Introduction 


Word  importance  discrimination  is  a  task  deserving  attention  when  one  treats 
a  topic  from  TREC  where  a  topic  is  quite  long.  The  goal  of  the  process  is  to 
estimate  importance  of  words  which  carry  any  (additional)  information  about 
user  information  needs.  Word  importance  discrimination  task  is  strongly  re¬ 
lated  to  word  hltering  where  word  importance  is  a  binary  value.  There  were 
proposed  several  approaches  addressing  word  hltering.  Luhn  [3]  uses  a  sim¬ 
ple  hlter  based  on  frequency  of  term  occurrence,  Bookstain  et  al.  [1]  detect 
content-bearing  words  by  serial  clustering,  Picard  [4]  suggests  to  use  term  sim¬ 
ilarities,  and  Takayama  et  al.  [6]  used  SVD  decomposition  of  co-occurrence 
matrix.  In  our  experiments  we  estimated  word  importance  based  on  context 
of  a  word,  which  is  a  probability  distribution  over  words  that  can  be  met  in 
the  same  documents  as  the  given  word.  Intuitively,  one  can  say  that  a  word  is 
important  if  it  has  specihc  meaning  in  a  domain  and  occurs  in  the  relatively 
small  number  of  documents.  This  implies  that  a  word  has  specihc  meaning  if 
its  context  is  of  low  entropy.  We  use  this  idea  to  estimate  importance  of  both 
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words  and  phrases  in  a  document  collection  and  perform  experiments  to  find 
out  what  kind  of  word  importance  discrimination  shows  better  results. 


2  Methodology 

2.1  Context  Document  Clustering  algorithm  (CDC) 


Context  Document  Clustering  algorithm  is  a  scalable  clustering  algorithm 
which  full  description  of  can  be  found  in  [2,5]. 

In  our  TREC2008  experiment  we  use  idea  of  term  context  which  plays  an  im¬ 
portant  role  in  CDC  algorithm.  Let  each  document  of  a  collection  be  presented 
as  a  probability  distribution  over  the  set  of  all  the  terms  of  the  collection  called 
a  prohle  of  the  document.  A  document  is  presented  by  a  probability  distribu¬ 
tion  over  the  set  of  terms  in  the  model: 

pm  =  (1) 

where  tft^d  is  the  number  of  occurrence  of  term  t  in  document  d  and  is  the 
total  number  of  terms  in  document  d.  The  occurrence  of  a  term  in  a  document 
is  assumed  independent  from  all  other  terms  of  the  document.  Nothing  is 
assumed  about  the  notion  of  “term”  except  the  fact  that  a  document  consists 
of  terms  and  the  set  of  all  the  terms  of  the  collection  is  the  set  of  terms  met 
in  a  document  of  the  collection.  A  context  is  created  for  each  term  which  is 
not  very  common  (e.g.  it  is  an  upper  bound  for  the  document  frequency  of  the 
term)  or  very  rare  (e.g.  it  is  a  lower  bound  for  the  document  frequency  of  the 
term)  in  the  collection.  A  context  of  a  term  f  is  a  probability  distribution  over 
all  the  terms  in  the  collection  and  the  entry  of  the  distribution  is  probability 
to  meet  the  term  t  with  another  term  in  the  same  document. 

The  term  contexts  are  used  to  select  important  words  (more  precisely,  im¬ 
portant  terms)  from  the  dictionary  of  the  collection.  Intuitively,  one  can  say 
that  a  word  is  important  if  it  has  specihc  meaning  in  a  domain  and  occurs 
in  the  relatively  small  number  of  documents.  In  terms  of  CDC  algorithm,  a 
word  has  specihc  meaning  if  its  context  is  of  low  entropy.  Hence,  we  can  dehne 
importance  of  a  term  in  the  following  way: 

imp{t)  df{t))H{t) 


(2) 


One  can  see  that  the  lower  entropy  of  a  term  is,  the  higher  its  importance, 
and  the  bigger  number  of  documents  containing  the  term  is,  the  lower  its 
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importance.  Also,  since,  for  example,  if  a  term  occurs  in  100  documents  or 
in  101  makes  smaller  impact  at  the  intuitive  term  importance  than  if  a  term 
occurs  in  1  documents  or  in  2  document  we  applied  log  to  document  frequency 
of  a  term. 

We  use  2-  or  3-word  sequences  in  documents  as  phrases.  The  importance  of  a 
phrase  is  a  sum  of  importance  of  terms  it  is  composed  of: 

impi{p)  =  '^impit).  (3) 

t&p 


3  Experiments 


We  have  made  four  experiments.  In  each  experiment  we  use  the  same  method¬ 
ology  with  slight  changes  that  allows  us  to  compare  results  of  our  experi¬ 
ments.  All  our  experiments  are  devoted  to  hnd  out  a  response  to  the  following 
question:  how  useful  can  be  phrases  and  important  words  extracted  from  a 
document  collection  in  automatic  way  using  context  information,  and,  par¬ 
ticularly,  what  way  should  be  chosen  using  word  importance  discrimination 
dehned  in  (2)  and  (3).  In  the  first  experiment,  STOeZ,  only  phrases  are  used 
to  score  documents  over  topics.  In  the  second  experiment,  xLQOW,  we  mix 
two  types  of  scores  obtained  by  a  document  against  a  topic:  score  obtained 
with  common  phrases  in  document  and  topic  and  score  obtained  with  com¬ 
mon  important  words.  In  the  third  experiment,  Krcy7,  we  expand  the  list  of 
phrases  by  phrases  from  documents  retrieved  in  experiment  STQeZ.  And  the 
fourth  experiment,  U2LwQ,  is  the  same  as  the  first  one  but  only  “query”  field 
is  used. 


3. 1  Common  part  of  experiments 


The  CSIRO  document  collection  is  parsed  in  the  following  way.  First  of  all,  we 
delete  stop-words  from  the  documents.  In  the  experiments  a  word  is  dehned  as 
a  string  containing  alphanumeric  symbols  and  at  least  one  letter,  specihcally  a 
word  satishes  the  “[a-z0-9]*[a-z]  [a-zO-9]*”  regular  expression.  Applying  Porter 
stemming  algorithm  to  words  we  obtain  stem  of  words  which  we  call  terms. 
The  number  of  terms  we  have  got  is  603349.  A  document  is  presented  by 
prohle  which  is  a  probability  distribution  dehned  by  (1).  We  create  contexts 
for  term  having  document  frequency  greater  or  equal  to  25  and  less  or  equal 
to  10000.  Term  importance  and  phrase  importance  are  calculated. 

Parsing  queries  we  concatenate  the  both  “query”  and  “narr”  helds  to  form  a 
topic.  We  parse  the  topic  deleting  stop- words  and  applying  stemming  to  words 
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defined  by  the  same  regular  expression  as  for  documents.  Terms  which  are  not 
in  the  dictionary  of  the  collection  are  ignored.  Each  topic  contains  a  lot  of 
terms  having  different  importance  which  should  be  estimated.  In  experiments 
we  test  several  ways  to  estimate  word  importance. 


3.2  STOeZ 


In  the  experiment,  a  document  gains  a  score  against  a  topic  if  the  document 
has  common  phrases  with  the  topic. 

scorep{d\q)  =  E  (impi  (p)  *  log  {pf  {p\d)  +  1)) ,  (4) 

p£d,p£q 

where  d  is  a  document,  p  is  a  phrase,  impi{p)  is  phrase  importance,  pf{p\d) 
is  the  number  of  occurrence  of  phrase  p  in  document  d,  and  the  summation  is 
done  over  common  phrases  of  a  topic  q  and  document  d.  We  report  the  hrst 
thousand  documents  for  each  topic  having  highest  scorep{d\q)  values. 


3.3  xLQOW 


Sometimes  the  scores  applied  in  experiment  STOeZ  are  too  strict  and  relaxing 
is  required.  In  this  experiment  we  mix  document  scores  obtained  with  phrases 
and  important  words. 

scoret{d\q)  =  ^  {imp{t)  *  \og{tf{t\d)  +  1)) ,  (5) 

where  d  is  a  document,  t  is  a  term,  imp(t)  is  term  importance,  tf(t\d)  is  the 
number  of  occurrence  of  term  t  in  document  d,  and  the  summation  is  done 
over  common  term  of  a  topic  q  and  document  d. 

We  mix  (4)  and  (5)  scores: 

score{d\q)  =  \  *  scorep{d\q)  +  (1  —  A)  *  scoret{d\q) , 

where  0  <  A  <  1. 

We  optimized  A  coefficient  using  topics  and  relevance  judgements  of  TREC  2007. 
The  optimal  value  of  A  is  0.7. 

We  report  the  hrst  thousand  documents  for  each  topic  having  highest  score{d\q) 
values. 
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3-4  Krcy7 


In  this  experiment  we  use  different  kind  of  relaxing  than  in  the  previous  one. 
We  consider  scores  from  experiment  8T0eZ.  Let  us  assume  that  we  deal  with 
topic  q  and  document  d  which  has  a  number  of  common  phrases  with  topic  q. 
The  common  phrases  of  document  d  and  topic  q  are  called  phrases  of  level  1. 

Let  us  dehne  a  set  of  documents  containing  given  phrase  p. 

D{p)  =  {d\p  e  d}. 


We  weight  phrase  p  by  its  level  1  importance  with  scores  of  documents  con¬ 
taining  phrase  p  against  all  the  topics: 

imp2{p)  =  ^  '^{impiip)  *  scorep{d\q)), 

d&D{p)  Q 

where  g  is  a  topic,  d  is  a  document,  p  is  a  phrase  occurring  in  document  d, 
impiip)  is  importance  of  phrase  according  (3).  Hence,  we  get  what  we  call 
“level  2  importance”  of  phrase.  So,  if  a  document  has  a  phrase  of  level  1 
against  a  topic  it  has  a  number  of  phrases  of  level  2  with  level  2  importance. 
We  use  these  phrases  to  expand  list  of  phrases  to  search.  We  note  that  phrases 
of  level  1  are  phrases  of  level  2,  too. 

Let  us  dehne  a  set  of  documents  having  common  phrases  with  a  topic: 

L{q)  =  {d|3p,p  e  d,p  G  q], 
and  scoring  function  for  documents  against  topics  is 


scorep2{,d\q)  =  ^  {impi  (p)  *  log  (p/  (p|d)  +  1))  + 

psd, 

p&q 

+  (p)  *  log  (p/  (p|d)  -h  1))  . 

p&d, 

d&L{q) 


We  report  the  hrst  thousand  documents  for  each  topic  having  highest  scorep2{d\q) 
values. 


3.5  U2LwQ 


Documents  are  treated  as  in  STOeZ  experiment.  The  held  “query”  is  used  as  a 
topic.  We  used  all  words  in  “query”  held  which  were  met  in  the  collection.  All 
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Fig.  1.  AP  measure  on  each  topic  with  (left)  and  without  (right)  median  results 
over  all  systems.  Topics  ids  are  placed  at  abscissa  axis,  and  AP  measure  values  are 
places  at  ordinate  axis. 

the  other  procedures  are  the  same  as  in  STOeZ.  We  report  the  hrst  thousand 
documents  for  each  topic  having  highest  score{d\q)  values. 


4  Experimental  results 


The  experimental  results  are  presented  in  Table  1.  The  results  conhrm  that 
the  scores  applied  in  8T0eZ  is  too  strict  and  the  quality  of  retrieval  can  be 
improved  by  using  important  words  equally  with  important  phrases,  as  in 
experiment  xLQOW.  The  attempt  to  use  important  phrases  of  level  2  does 
not  give  an  advantage,  see  experiment  Krcy7. 


STOeZ 

xLQOW 

Krcy7 

U2LwQ 

median 

best 

infAP 

0.0723 

0.1300 

0.0392 

0.0339 

0.2670 

0.5541 

infNDCG 

0.1538 

0.3057 

0.1144 

0.0742 

0.4679 

0.7803 

Table  1 

Average  measures  over  all  topics.  Median  is  the  average  over  of  all  the  topics  of  the 
median  measures  of  all  the  participated  systems,  and  best  is  the  average  over  of  all 
the  topics  of  the  best  achieved  result  among  all  the  participated  systems. 

Experimental  results  at  each  topic  are  presented  at  Fig.l  and  Fig.2.  One  can 
see  from  left  graphs  of  Fig.l  and  Fig.2  that  results  are  worse  than  median  re¬ 
sults  over  all  the  systems  for  most  topics  but  in  two  cases  for  AP  measure  and 
in  three  cases  for  NDCG  measure  results  are  better  than  median  results.  Ob¬ 
serving  right  graphs  of  Fig.l  and  Fig.2  we  can  see  that  relaxed  scores  applied 
in  experiment  xLQOW  performs  better  than  scores  of  experiment  STOeZ  in 
most  cases,  but  strict  scores  STOeZ  and  very  relaxed  scores  Krcy7  can  perform 
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Fig.  2.  NDCG  measure  on  each  topic  with  (left)  and  without  (right)  median  results 
over  all  systems.  Topics  ids  are  placed  at  abscissa  axis,  and  NDCG  measure  values 
are  places  at  ordinate  axis. 

better  than  relaxed  scores  xLQOW  in  some  cases.  Considering  all  the  hgnres 
and  content  of  topics  we  fonnd  ont  that  better  performance  of  scores,  like 
over  performing  median  resnlts  and  achieving  valnes  of  AP  measnre  higher 
than  0.48  and  NDCG  measnre  higher  than  0.78,  is  reached  at  short  topics 
containing  almost  only  informative  important  words. 
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